49 research outputs found

    Parallel-Architecture Simulator Development Using Hardware Transactional Memory

    Get PDF
    To address the need for a simpler parallel programming model, Transactional Memory (TM) has been developed and promises good parallel performance with easy-to-write parallel code. Unlike lock-based approaches, with TM, programmers do not need to explicitly specify and manage the synchronization among threads. However, programmers simply mark code segments as transactions, and the TM system manages the concurrency control for them. TM can be implemented either in software (STM) or hardware (HTM). STMs are more flexible but suffer from serious performance overheads whereas HTMs are faster but limited due to hardware space constrains. We present an implementation of a HTM system, based on an existing protocol (Scalable-TCC), over a full-system simulator. We provide a memory system that allows for a configurable number of cache entries, associativity, cache-line size, and all the access timings in the memory hierarchy. Combined with a powerful statistics system that provides all the necessary information to extract conclusions from the transactional executions. We evaluate our HTM system using applications that cover a wide range of transactional behaviours and demonstrate that it scales efficiently up to 32 processors

    Techniques to improve concurrency in hardware transactional memory

    Get PDF
    Transactional Memory (TM) aims to make shared memory parallel programming easier by abstracting away the complexity of managing shared data. The programmer defines sections of code, called transactions, which the TM system guarantees that will execute atomically and in isolation from the rest of the system. The programmer is not required to implement such behaviour, as happens in traditional mutual exclusion techniques like locks - that responsibility is delegated to the underlying TM system. In addition, transactions can exploit parallelism that would not be available in mutual exclusion techniques; this is achieved by allowing optimistic execution assuming no other transaction operates concurrently on the same data. If that assumption is true the transaction commits its updates to shared memory by the end of its execution, otherwise, a conflict occurs and the TM system may abort one of the conflicting transactions to guarantee correctness; the aborted transaction would roll-back its local updates and be re-executed. Hardware and software implementations of TM have been studied in detail. However, large-scale adoption of software-only approaches have been hindered for long due to severe performance limitations. In this thesis, we focus on identifying and solving hardware transactional memory (HTM) issues in order to improve concurrency and scalability. Two key dimensions determine the HTM design space: conflict detection and speculative version management. The first determines how conflicts are detected between concurrent transactions and how to resolve them. The latter defines where transactional updates are stored and how the system deals with two versions of the same logical data. This thesis proposes a flexible mechanism that allows efficient storage and access to two versions of the same logical data, improving overall system performance and energy efficiency. Additionally, in this thesis we explore two solutions to reduce system contention - circumstances where transactions abort due to data dependencies - in order to improve concurrency of HTM systems. The first mechanism provides a suitable design to apply prefetching to speed-up transaction executions, lowering the window of time in which such transactions can experience contention. The second is an accurate abort prediction mechanism able to identify, before a transaction's execution, potential conflicts with running transactions. This mechanism uses past behaviour of transactions and locality in memory references to infer predictions, adapting to variations in workload characteristics. We demonstrate that this mechanism is able to manage contention efficiently in single-application and multi-application scenarios. Finally, this thesis also analyses initial real-world HTM protocols that recently appeared in market products. These protocols have been designed to be simple and easy to incorporate in existing chip-multiprocessors. However, this simplicity comes at the cost of severe performance degradation due to transient and persistent livelock conditions, potentially preventing forward progress. We show that existing techniques are unable to mitigate this degradation effectively. To deal with this issue we propose a set of techniques that retain the simplicity of the protocol while providing improved performance and forward progress guarantees in a wide variety of transactional workloads

    Parallel-Architecture Simulator Development Using Hardware Transactional Memory

    Get PDF
    To address the need for a simpler parallel programming model, Transactional Memory (TM) has been developed and promises good parallel performance with easy-to-write parallel code. Unlike lock-based approaches, with TM, programmers do not need to explicitly specify and manage the synchronization among threads. However, programmers simply mark code segments as transactions, and the TM system manages the concurrency control for them. TM can be implemented either in software (STM) or hardware (HTM). STMs are more flexible but suffer from serious performance overheads whereas HTMs are faster but limited due to hardware space constrains. We present an implementation of a HTM system, based on an existing protocol (Scalable-TCC), over a full-system simulator. We provide a memory system that allows for a configurable number of cache entries, associativity, cache-line size, and all the access timings in the memory hierarchy. Combined with a powerful statistics system that provides all the necessary information to extract conclusions from the transactional executions. We evaluate our HTM system using applications that cover a wide range of transactional behaviours and demonstrate that it scales efficiently up to 32 processors

    Design trade-offs for emerging HPC processors based on mobile market technology

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: http://dx.doi.org/10.1007/s11227-019-02819-4High-performance computing (HPC) is at the crossroads of a potential transition toward mobile market processor technology. Unlike in prior transitions, numerous hardware vendors and integrators will have access to state-of-the-art processor designs due to Arm’s licensing business model. This fact gives them greater flexibility to implement custom HPC-specific designs. In this paper, we undertake a study to quantify the different energy-performance trade-offs when architecting a processor based on mobile market technology. Through detailed simulations over a representative set of benchmarks, our results show that: (i) a modest amount of last-level cache per core is sufficient, leading to significant power and area savings; (ii) in-order cores offer favorable trade-offs when compared to out-of-order cores for a wide range of benchmarks; and (iii) heterogeneous configurations help to improve processor performance and energy efficiency.Peer ReviewedPostprint (author's final draft

    Implications of non-volatile memory as primary storage for database management systems

    Get PDF
    Traditional Database Management System (DBMS) software relies on hard disks for storing relational data. Hard disks are cheap, persistent, and offer huge storage capacities. However, data retrieval latency for hard disks is extremely high. To hide this latency, DRAM is used as an intermediate storage. DRAM is significantly faster than disk, but deployed in smaller capacities due to cost and power constraints, and without the necessary persistency feature that disks have. Non-Volatile Memory (NVM) is an emerging storage class technology which promises the best of both worlds. It can offer large storage capacities, due to better scaling and cost metrics than DRAM, and is non-volatile (persistent) like hard disks. At the same time, its data retrieval time is much lower than that of hard disks and it is also byte-addressable like DRAM. In this paper, we explore the implications of employing NVM as primary storage for DBMS. In other words, we investigate the modifications necessary to be applied on a traditional relational DBMS to take advantage of NVM features. As a case study, we have modified the storage engine (SE) of PostgreSQL enabling efficient use of NVM hardware. We detail the necessary changes and challenges such modifications entail and evaluate them using a comprehensive emulation platform. Results indicate that our modified SE reduces query execution time by up to 40% and 14.4% when compared to disk and NVM storage, with average reductions of 20.5% and 4.5%, respectively.The research leading to these results has received funding from the European Union’s 7th Framework Programme under grant agreement number 318633, the Ministry of Science and Technology of Spain under contract TIN2015-65316-P, and a HiPEAC collaboration grant awarded to Naveed Ul Mustafa.Peer ReviewedPostprint (author's final draft

    Efficient direct convolution using long SIMD instructions

    Get PDF
    This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.This work receives EuroHPC-JU funding under grant no. 101034126, with support from the Horizon2020 program. Adrià Armejach is a Serra Hunter Fellow and has been partially supported by the Grant IJCI-2017-33945 funded by MCIN/AEI/10.13039/501100011033. Marc Casas has been par-tially supported by the Grant RYC-2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future. This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project and the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (author's final draft

    Porting and optimizing BWA-MEM2 using the Fujitsu A64FX processor

    Get PDF
    Sequence alignment pipelines for human genomes are an emerging workload that will dominate in the precision medicine field. BWA-MEM2 is a tool widely used in the scientific community to perform read mapping studies. In this paper, we port BWA-MEM2 to the AArch64 architecture using the ARMv8-A specification, and we compare the resulting version against an Intel Skylake system both in performance and in energy-to-solution. The porting effort entails numerous code modifications, since BWA-MEM2 implements certain kernels using x86 64 specific intrinsics, e.g., AVX-512. To adapt this code we use the recently introduced Arm’s Scalable Vector Extensions (SVE). More specifically, we use Fujitsu’s A64FX processor, the first to implement SVE. The A64FX powers the Fugaku Supercomputer that led the Top500 ranking from June 2020 to November 2021. After porting BWA-MEM2 we define and implement a number of optimizations to improve performance in the A64FX target architecture. We show that while the A64FX performance is lower than that of the Skylake system, A64FX delivers 11.6% better energy-to-solution on average. All the code used for this article is available at https://gitlab.bsc.es/rlangari/bwa-a64fx

    Design Space Exploration of Next-Generation HPC Machines

    Get PDF
    The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. With the goal of improving the efficiency of next-generation large HPC systems, designers require tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. We simulate five hybrid (MPI+OpenMP) applications over 864 architectural proposals based on stateof-the-art and emerging HPC technologies, relevant both in industry and research. This paper significantly extends our previous work with MUltiscale Simulation Approach (MUSA) enabling accurate performance and power estimations of largescale HPC systems. We reveal that several applications present critical scalability issues mostly due to the software parallelization approach. Looking at speedup and energy consumption exploring the design space (i.e., changing memory bandwidth, number of cores, and type of cores), we provide evidence-based architectural recommendations that will serve as hardware and software codesign guidelines.Preprin

    Commit on overflow

    Get PDF
    Current commercial CPUs have hardware support for speculative lock elision (SLE). SLE tries to elide the lock by speculatively executing lock protected critical section. If the speculation fails, SLE acquires the lock and re-executes the critical section non-speculatively. Latest Intel CPUs implement SLE and hardware transactional memory (HTM) where SLE uses HTM transactions to speculatively execute critical sections. HTM only supports bounded size transactions where non-conflicting transactions execute until they overflow and abort. Bounded sized transactions impose the limit on the size of SLE protected critical sections. Even worse, the current SLE implementation execute large non-conflicting critical sections twice; first time, speculatively in a transaction, and second time, non-speculatively by acquiring the lock at the beginning of the critical section. Ideally, SLE should execute all non-conflicting critical sections exactly once. This paper introduces a \emph{commit on overflow} (COO) transaction abort policy which -- instead of aborting -- commits overflowed transaction and continues executing it. We show the usefulness of COO while executing large SLE protected critical sections. Also, we show that our COO implementation preserves atomicity of SLE protected critical sections.Postprint (published version

    A BF16 FMA is all you need for DNN training

    Get PDF
    Fused Multiply-Add (FMA) functional units constitute a fundamental hardware component to train Deep Neural Networks (DNNs). Its silicon area grows quadratically with the mantissa bit count of the computer number format, which has motivated the adoption of the BrainFloat16 format (BF16). BF16 features 1 sign, 8 exponent and 7 explicit mantissa bits. Some approaches to train DNNs achieve significant performance benefits by using the BF16 format. However, these approaches must combine BF16 with the standard IEEE 754 Floating-Point 32-bit (FP32) format to achieve state-of-the-art training accuracy, which limits the impact of adopting BF16. This article proposes the first approach able to train complex DNNs entirely using the BF16 format. We propose a new class of FMA operators, FMAbf16 n m, that entirely rely on BF16 FMA hardware instructions and deliver the same accuracy as FP32. FMAbf16 n m operators achieve performance improvements within the 1.28- 1.35X range on ResNet101 with respect to FP32. FMAbf16 n m enables training complex DNNs on simple low-end hardware devices without requiring expensive FP32 FMA functional units.Marc Casas was partially supported by the under Grant RYC-2017-23269 funded by MCIN/AEI/10.13039/501100011033, and by ESF Investing in your future. Adrià Armejach is a Serra Hunter Fellow and has been partially supported by the under Grant IJCI-2017-33945 funded by MCIN/AEI/10.13039/501100011033. John Osorio has been partially supported by the under Grant PRE2019-090406 funded by MCIN/AEI/10.13039/501100011033 and by ESF Investing in your future. This work has been partially supported by Intel under the BSC-Intel collaboration and European Union Horizon 2020 research and innovation programme under Grant 955606 - DEEP-SEA EU project.Peer ReviewedPostprint (author's final draft
    corecore